Randomized Language Models via Perfect Hash Functions

نویسندگان

  • David Talbot
  • Thorsten Brants
چکیده

We propose a succinct randomized language model which employs a perfect hash function to encode fingerprints of n-grams and their associated probabilities, backoff weights, or other parameters. The scheme can represent any standard n-gram model and is easily combined with existing model reduction techniques such as entropy-pruning. We demonstrate the space-savings of the scheme via machine translation experiments within a distributed language modeling framework.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Simulated Annealing Algorithm for Generating Minimal Perfect Hash Functions

We developed minimal perfect hash functions for a variety of datasets using the probabilistic process of simulated annealing (SA). The SA solution structure is a tree representing an annealed program (algorithm). This solution structure is similar to the structure used in genetic programming. When executed, the SA program produces multiple hash functions for the given data set. An initial hash ...

متن کامل

A Finite-State Library for NLP

A library of functions is described which use finite-state automata for compact storage and efficient usage of very large dictionaries and language models. The library can be used to test whether a word is in a dictionary, to perform morphological analysis, to construct perfect hash tables, and to construct and use very large language models (such as models which employ bigram and trigram frequ...

متن کامل

Generating Minimal Perfect Hash Functions

The randomized, deterministic and parallel algorithms for generating minimal perfect hash functions (MPHF) are proposed. Given a set of keys, W, which are character strings over some alphabet, the algorithms using a three-step approach (mapping, ordering, searching) nd the MPHF of the form h(w) = (h0(w) + g(h1(w)) + g(h2(w)))mod m, w 2 W, where h0, h1, h2 are auxiliary pseudorandom functions, m...

متن کامل

Stream-based Randomised Language Models for SMT

Randomised techniques allow very big language models to be represented succinctly. However, being batch-based they are unsuitable for modelling an unbounded stream of language whilst maintaining a constant error rate. We present a novel randomised language model which uses an online perfect hash function to efficiently deal with unbounded text streams. Translation experiments over a text stream...

متن کامل

On the Structure and Complexity of Infinite Sets with Minimal Perfect Hash Functions

This paper studies the class of infinite sets that have minimal perfect hash functions­ one-to-one onto maps between the sets and E·-computable in polynomial time. We show that all standard NP-complete sets have polynomial-time computable minimal per­ fect hash functions, and give a structural condition sufficient to ensure that all infinite NP sets have polynomial-time computable minimal perfe...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008